Experimentation

Loading Required Libraries

import pandas as pd
import numpy as np
import requests
import json
import os
import mlflow
import datetime
import plotly.graph_objects as go
from great_tables import GT


from statsforecast import StatsForecast
from statsforecast.models import (
    HoltWinters,
    CrostonClassic as Croston, 
    HistoricAverage,
    DynamicOptimizedTheta,
    SeasonalNaive,
    AutoARIMA,
    AutoETS,
    AutoTBATS,
    MSTL

)

from mlforecast import MLForecast
from mlforecast.target_transforms import Differences
from mlforecast.utils import PredictionIntervals
from window_ops.expanding import expanding_mean
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from utilsforecast.plotting import plot_series
from statistics import mean

Data

Loading metadata:

raw_json = open("../settings/settings.json")
meta_json = json.load(raw_json)

meta_path = meta_json["meta_path"]
data_path = meta_json["data"]["data_path"]
series_mapping_path = meta_json["data"]["series_mapping_path"]

Loading the dataset:

df = pd.read_csv(data_path)
ts = df[["period", "subba", "y"]].copy()
ts["ds"] = pd.to_datetime(ts["period"])
ts = ts[["ds", "subba", "y"]]
ts = ts.rename(columns={"subba":"unique_id"})

GT(ts.head(10))
/tmp/ipykernel_14678/3287265490.py:1: DtypeWarning:

Columns (4,5,6,8) have mixed types. Specify dtype option on import or set low_memory=False.
ds unique_id y
2022-01-01 00:00:00 ZONA 1707.0
2022-01-01 01:00:00 ZONA 1673.0
2022-01-01 02:00:00 ZONA 1644.0
2022-01-01 03:00:00 ZONA 1605.0
2022-01-01 04:00:00 ZONA 1550.0
2022-01-01 05:00:00 ZONA 1487.0
2022-01-01 06:00:00 ZONA 1422.0
2022-01-01 07:00:00 ZONA 1373.0
2022-01-01 08:00:00 ZONA 1336.0
2022-01-01 09:00:00 ZONA 1317.0
fig = go.Figure()

for i in ts["unique_id"].unique():
  d = None
  d = ts[ts["unique_id"] == i]
  name = i,
  fig.add_trace(go.Scatter(x=d["ds"], 
    y=d["y"], 
    name = i,
    mode='lines'))
    
fig.update_layout(title = "The Hourly Demand for Electricity in New York by Independent System Operator")
fig
fig = plot_series(ts, max_ids= len(ts.unique_id.unique()), 
plot_random=False, 
max_insample_length=24 * 30,
engine = "plotly")
fig.update_layout(title = "The Hourly Demand for Electricity in New York by Independent System Operator")
fig

Models Settings

Loading the backtesting settings:

import backtesting
bkt_settings = meta_json["backtesting"]["settings"]
models_settings = meta_json["backtesting"]["models"]
leaderboard_path = meta_json["backtesting"]["leaderboard_path"]
models_settings.keys()
dict_keys(['model1', 'model2', 'model3', 'model4'])
bkt = backtesting.backtesting(input = ts, 
models = models_settings, 
settings = bkt_settings)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001556 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 828
[LightGBM] [Info] Number of data points in the train set: 271689, number of used features: 6
[LightGBM] [Info] Start training from score 1559.341170
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001690 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 828
[LightGBM] [Info] Number of data points in the train set: 273009, number of used features: 6
[LightGBM] [Info] Start training from score 1558.672466
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001713 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 828
[LightGBM] [Info] Number of data points in the train set: 271953, number of used features: 6
[LightGBM] [Info] Start training from score 1559.180763
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002363 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 828
[LightGBM] [Info] Number of data points in the train set: 273273, number of used features: 6
[LightGBM] [Info] Start training from score 1558.428631
/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:36: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:37: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:36: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:37: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:36: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:37: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004364 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6438
[LightGBM] [Info] Number of data points in the train set: 271689, number of used features: 28
[LightGBM] [Info] Start training from score 1559.341170
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003561 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6438
[LightGBM] [Info] Number of data points in the train set: 273009, number of used features: 28
[LightGBM] [Info] Start training from score 1558.672466
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003766 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6438
[LightGBM] [Info] Number of data points in the train set: 271953, number of used features: 28
[LightGBM] [Info] Start training from score 1559.180763
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003832 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6438
[LightGBM] [Info] Number of data points in the train set: 273273, number of used features: 28
[LightGBM] [Info] Start training from score 1558.428631
/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:36: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:37: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:36: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:37: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:35: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:36: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:37: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/workspaces/pydata-ny-ga-workshop/experimentation/backtesting.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
bkt.head()
unique_id ds y cutoff forecast lower upper model partition label type
0 ZONA 2024-11-02 03:00:00 1646.0 2024-11-02 02:00:00 1624.552326 1586.092600 1663.012052 LGBMRegressor 1 model3 mlforecast
1 ZONA 2024-11-02 04:00:00 1574.0 2024-11-02 02:00:00 1546.074202 1485.240246 1606.908158 LGBMRegressor 1 model3 mlforecast
2 ZONA 2024-11-02 05:00:00 1525.0 2024-11-02 02:00:00 1483.073939 1403.047782 1563.100095 LGBMRegressor 1 model3 mlforecast
3 ZONA 2024-11-02 06:00:00 1491.0 2024-11-02 02:00:00 1438.655603 1347.213402 1530.097804 LGBMRegressor 1 model3 mlforecast
4 ZONA 2024-11-02 07:00:00 1465.0 2024-11-02 02:00:00 1406.390304 1317.906730 1494.873878 LGBMRegressor 1 model3 mlforecast
score = backtesting.bkt_score(bkt = bkt)
score

GT(score.score.head(10))
unique_id label type partition model mape rmse coverage model_unique_id
ZONA model3 mlforecast 1 LGBMRegressor 0.04808381078180457 94.12538805498055 0.75 model3_LGBMRegressor
ZONA model3 mlforecast 1 XGBRegressor 0.04122291700548332 70.9447987317613 0.8333333333333334 model3_XGBRegressor
ZONA model3 mlforecast 1 LinearRegression 0.054969464405311814 116.83187311287435 0.7083333333333334 model3_LinearRegression
ZONA model3 mlforecast 2 LGBMRegressor 0.054025274190122194 90.39527047940915 0.625 model3_LGBMRegressor
ZONA model3 mlforecast 2 XGBRegressor 0.05206026182409249 86.92895097842786 0.5 model3_XGBRegressor
ZONA model3 mlforecast 2 LinearRegression 0.04193650668855765 76.11480265935728 0.7916666666666666 model3_LinearRegression
ZONA model4 mlforecast 1 LGBMRegressor 0.04745269279367267 102.2861199054818 0.6666666666666666 model4_LGBMRegressor
ZONA model4 mlforecast 1 XGBRegressor 0.044794728806845666 89.66242612058517 0.5416666666666666 model4_XGBRegressor
ZONA model4 mlforecast 1 LinearRegression 0.0490540328487123 107.04369540710931 0.7083333333333334 model4_LinearRegression
ZONA model4 mlforecast 2 LGBMRegressor 0.045004910765600196 79.88943396141732 0.625 model4_LGBMRegressor
GT(score.leaderboard.head(10))
model_unique_id unique_id label model type partitions avg_mape avg_rmse avg_coverage
model3_LGBMRegressor ZONA model3 LGBMRegressor mlforecast 2 0.05105454248596338 92.26032926719485 0.6875
model3_XGBRegressor ZONA model3 XGBRegressor mlforecast 2 0.04664158941478791 78.93687485509457 0.6666666666666667
model3_LinearRegression ZONA model3 LinearRegression mlforecast 2 0.04845298554693473 96.47333788611581 0.75
model4_LGBMRegressor ZONA model4 LGBMRegressor mlforecast 2 0.04622880177963643 91.08777693344956 0.6458333333333333
model4_XGBRegressor ZONA model4 XGBRegressor mlforecast 2 0.044366760180237774 84.65950275845502 0.625
model4_LinearRegression ZONA model4 LinearRegression mlforecast 2 0.04199975306929239 84.10155074045954 0.6666666666666667
model3_LGBMRegressor ZONB model3 LGBMRegressor mlforecast 2 0.07001534796055241 76.12399398461962 0.6458333333333333
model3_XGBRegressor ZONB model3 XGBRegressor mlforecast 2 0.06261402304400177 67.00706044346474 0.6875
model3_LinearRegression ZONB model3 LinearRegression mlforecast 2 0.08893219198931748 96.47698299416918 0.6875
model4_LGBMRegressor ZONB model4 LGBMRegressor mlforecast 2 0.07769366263884078 89.72593726965465 0.6875
GT(score.top)
model_unique_id unique_id label model type partitions avg_mape avg_rmse avg_coverage
model4_LinearRegression ZONA model4 LinearRegression mlforecast 2 0.04199975306929239 84.10155074045954 0.6666666666666667
model4_LinearRegression ZONB model4 LinearRegression mlforecast 2 0.059693724291071074 71.44287678272455 0.7083333333333334
model4_LinearRegression ZONC model4 LinearRegression mlforecast 2 0.07667990292340046 142.12392081140575 0.8333333333333333
model4_XGBRegressor ZOND model4 XGBRegressor mlforecast 2 0.0502338078173399 34.92911205247234 1.0
model4_LGBMRegressor ZONE model4 LGBMRegressor mlforecast 2 0.07676213541047594 62.225834722324834 0.7708333333333333
model4_LGBMRegressor ZONF model4 LGBMRegressor mlforecast 2 0.05050550495267777 70.99907528191949 0.8125
model4_LinearRegression ZONG model4 LinearRegression mlforecast 2 0.07512586274300675 77.84627628021269 0.5208333333333334
model3_XGBRegressor ZONH model3 XGBRegressor mlforecast 2 0.08949055776360956 23.865196728042697 0.625
model3_XGBRegressor ZONI model3 XGBRegressor mlforecast 2 0.04725835475871128 31.2237722148145 0.8958333333333334
model3_XGBRegressor ZONJ model3 XGBRegressor mlforecast 2 0.03746087999724491 192.18373562844218 0.7916666666666667
model3_XGBRegressor ZONK model3 XGBRegressor mlforecast 2 0.06279484434954861 144.5915670000412 0.7291666666666667
score.top.to_csv(leaderboard_path, index = False)